fix: retry transient per-object delete failures in removeObjects() cleanup#1700
fix: retry transient per-object delete failures in removeObjects() cleanup#1700allanrogerr wants to merge 12 commits intominio:masterfrom
Conversation
The removeObjects() cleanup helper silently ignored all DeleteErrors via ignore(r.get()), masking transient server-side failures. When the batch delete of 1050 objects returned partial errors (e.g. SlowDown, InternalError), those objects remained in the bucket and the subsequent removeBucket() call failed with BucketNotEmpty — surfacing a cleanup race as a false test failure. Replace the single ignore() call with two focused helpers: - retryRemoveObject(): retries a single object with exponential backoff (500ms base) up to MAX_DELETE_RETRIES attempts. Accepts NoSuchKey/NoSuchVersion silently (already gone), retries SlowDown/InternalError/RequestTimeout/ServiceUnavailable, and propagates anything else immediately. - removeObjects(): fully consumes the batch-delete iterator first (so all 1000-object HTTP batches are sent), then delegates each DeleteResult.Error to retryRemoveObject() individually. DeleteResult.Error only exposes objectName() via ErrorResponse (no versionId), so a HashMap<String,List<ObjectWriteResponse>> index preserves correct versioned-delete semantics by storing all responses keyed by name and retrying every (name,versionId) pair on a miss.
Switch Result<?> to Result<DeleteResult.Error> to restore compile-time type safety and eliminate the unchecked cast. Drain the full server response before throwing on a non-transient error so OkHttp can reuse the connection slot. Restore the thread interrupt flag if Thread.sleep is interrupted. Wrap TRANSIENT_DELETE_CODES in an unmodifiable set. Guard the testRemoveObjects finally block so the redundant cleanup batch is only sent on failure. Add bucket name to error messages and document the retry-by-name re-queueing trade-off.
There was a problem hiding this comment.
Pull request overview
Improves functional-test cleanup reliability by adding retry logic for transient per-object failures during multi-object delete, to prevent false BucketNotEmpty failures during post-test teardown.
Changes:
- Add transient-error classification and retry constants for multi-object deletion.
- Rework
removeObjects()to retry deletion based onDeleteResult.Errorresponses with exponential backoff. - Avoid double-invoking
removeObjects()intestRemoveObjects()when the first call succeeds.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
MinioAsyncClient.removeObjects() sets completed=true as soon as any 1000-object chunk returns errors, silently dropping all subsequent chunks in that call. When toDelete exceeds 1000 entries and a transient error hits the first chunk, objects in later chunks are never attempted and not carried into the retry set, leaving the bucket non-empty. Fix by iterating toDelete in DELETE_BATCH_SIZE slices and issuing a separate removeObjects() call per slice so each slice is always fully processed regardless of errors in other slices.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
After capturing a non-transient error, the previous code continued issuing removeObjects() calls for all remaining chunks before throwing. Move the nonTransientErr check to after each individual chunk's response is drained so no further delete RPCs are made once a fatal error is seen.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Rename MAX_DELETE_RETRIES to MAX_DELETE_ATTEMPTS to better reflect that the constant bounds attempts, not retries. Include err.message() in the non-transient IOException for richer diagnostics. Track failedNames across iterations and include a 5-object sample in the exhaustion IOException to aid debugging when all retries are consumed.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
If removeObjects() in the try block throws, handleException() rethrows it. A second throw from the finally cleanup would then mask the original exception, making the test failure harder to diagnose. Wrap the cleanup call in try/catch so the original exception propagates unmasked.
The cleanup catch block in the finally clause intentionally discards the exception so the original test failure propagates unmasked. Suppress the DE_MIGHT_IGNORE finding at the method level.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ailure Replace byName index, retryNames set, and toDelete rebuild loop with a single anyTransient boolean flag. Retry the full object list on transient errors rather than narrowing to failing names — already-deleted objects are silently ignored by S3 and MinioAsyncClient filters NoSuchVersion.
6c2be27 to
039724d
Compare
|
@balamurugana PTAL |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Implements SDK-level automatic retry matching the minio-go retry-mechanism spec. All requests through BaseS3Client.executeAsync() now retry on retryable S3 error codes (InternalError, SlowDown, RequestTimeout, etc.), retryable HTTP status codes (408, 429, 499, 500, 502, 503, 504, 520), and transient IO errors (connection reset, EOF, server closed idle connection). Non-seekable request bodies (raw okhttp3.RequestBody) get a single attempt; seekable bodies (byte[], ByteBuffer, RandomAccessFile) retry up to maxRetries (default 10). Backoff is full-jitter exponential: rand(0, min(1s, 200ms*2^n)) before the nth retry, matching minio-go's DefaultRetryUnit/DefaultRetryCap. The "RetryHead" S3 code is explicitly excluded so executeHeadAsync region redirect logic is unaffected. maxRetries is configurable via builder and setMaxRetries() post-construction. Closes minio#1700.
Address review feedback on PR minio#1700: - Use Exception instead of IOException in removeObjects() helper for consistency with the rest of the test class. - Remove the finally/succeeded retry in testRemoveObjects() now that removeObjects() retries transient failures internally; drops the associated DE_MIGHT_IGNORE SpotBugs suppression.
Problem
The
removeObjects()cleanup helper inTestMinioClient.javasilently discarded all deletion errors viaignore(r.get()). On transient server-side errors (e.g.SlowDown,InternalError) during a batch delete of 1050 objects, the failing objects remain in the bucket. The subsequentremoveBucket()call then returns409 BucketNotEmpty, causing the test to log a false FAIL from the cleanup path — even when the test logic itself passed.This failure was observed in 5 distinct CI runs across the week ending 2026-04-24, always in
testListObjectswith 1050 objects, always as the sole minio-java failure while 13 of 14 other mint suites passed. Tracked in miniohq/eos#4608.CI evidence (all show the same
BucketNotEmptyonlistObjects()/ 1050 objects):Repeated error fragment across all logs: